bugfix: refactor alerts to accomodate for single-node clusters #1010

rexagod · 2025-01-06T06:29:38Z

For the sake of brevity, let:

Q:  kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"} (allocable)
QQ: namespace_cpu:kube_pod_container_resource_requests:sum{} (requested)

thus, both quota alert expressions relevant here (KubeCPUOvercommit and KubeMemoryOvercommit) exist in the form: sum(QQ) by (cluster) - (sum(Q) by (cluster) - max(Q) by (cluster)) > 0 and (sum(Q) by (cluster) - max(Q) by (cluster)) > 0, which, in case of a single-node cluster (sum(Q) by (cluster) = max(Q) by (cluster)), is reduced to, sum(QQ) by (cluster) > 0, i.e., the alert will fire if any request limits exist.

To address this, drop the max(Q) by (cluster) buffer assumed in non-SNO clusters from SNO, reducing the expression to: sum(QQ) by (cluster) - sum(Q) by (cluster) > 0 (total requeted - total allocable > 0 to trigger alert), since there is only a single node, so a buffer of the same sort does not make sense.

rexagod · 2025-01-06T08:15:48Z

alerts/resource_alerts.libsonnet

              and
-              (sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0
+              sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) -
+              sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0)


Suggested change

sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0)

0.95 * sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0)

Since a max(Q) buffer is not applicable in SNO, how about a numeric buffer of 5% (or more?)? That should help alert before things go out of budget.

alerts/resource_alerts.libsonnet

simonpasquier · 2025-01-09T08:36:35Z

alerts/resource_alerts.libsonnet

@@ -34,18 +34,34 @@
          } +
          if $._config.showMultiCluster then {
            expr: |||
-              sum(namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) by (%(clusterLabel)s) - (sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) - max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0
+              (count(kube_node_info) == 1


wouldn't (count(kube_node_info) == 1 and ... mess up the returned value (e.g. it would always return 1)? Given the complexity of the expression, I'd advocate for some unit tests in the first place to assert the current rule.

I'll add some unit tests around this, but I'm not sure why this will always return 1?

foo == 1 and bar will always return 1 (the right-hand side is only used for label matching).

(count by (%(clusterLabel)s) (max by (%(clusterLabel)s, node) (kube_node_info)) == 1 and sum by (%(clusterLabel)s) (namespace_cpu:kube_pod_container_resource_requests:sum{%(ignoringOverprovisionedWorkloadSelector)s}) - sum by (%(clusterLabel)s) (kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) > 0)

Apologies if I'm missing something here, but this seems to have boolean expressions on both sides, similar to, say, vector(1) == 1 and vector(2) - vector(1) > 0 (scalar vector on RHS of and instead of an instant vector, so no label matching)?

If this was the aforementioned case, I would've preferred the bool operator to avoid the default filtering behavior, but it seemed to suffice without that.

Had a chat with Simon. We essentially need to swap the LHS and RHS here to make sure the value being returned makes sense for the annotation, as it's currently being used by it (which I missed to notice).

Additionally, we want to hold this a bit and talk a bit more about the usefulness of the alert. As in, in a non-SNO environment, the alert indicates that the user should make accommodations for more resources by, say, adding more nodes, but in SNO, it just ends up being more-or-less of an OOM indicator.

Had a chat with @qJkee, with the conclusion that we want to keep these alerts while modifying them for SNO, as "it still applies to SNO in case if user requests more resources than it has in the cluster. Plus, in that case we can suggest to the users to add worker nodes or increase resources on the given node."

Additionally, Bulat suggested that alerting earlier in case of SNO makes sense, and thus we'd want a ~85% threshold (as usually they are loaded that % and even more because they run on the edge) before firing it.

@simonpasquier I'll go ahead and make these changes, in addition to the ones above, but PLMK if I missed something.

rexagod

Plus, in that case we can suggest to the users to add worker nodes or increase resources on the given node.

LMK if I should append the above to the existing descriptions to make them more relevant in case of SNO.

skl

@rexagod I added tests to your PR in eca85fe, hope you don't mind.

Also tested in a dev environment and compared the old rules with the new, I observed effectively the same behaviour - with the addition of support for single-node clusters.

I think the existing alert descriptions are self-explanatory, and imply action is required (reduce requests or increase capacity) regardless of number of nodes.

lgtm!

rexagod · 2025-01-27T09:10:34Z

Ah, sorry I completely missed adding the tests in my last commit. Thanks a bunch for the contribution! ❤️

I'll ask Simon to take a look here since with this patch we are essentially setting the trend for adapting such alerts for SNO, instead of dropping them as we have done internally (till now).

simonpasquier · 2025-01-27T12:43:09Z

alerts/resource_alerts.libsonnet

+              max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0
+              and
+              (sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) -
+              max(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s)) > 0)
            ||| % $._config,
            annotations+: {
              description: 'Cluster {{ $labels.%(clusterLabel)s }} has overcommitted CPU resource requests for Pods by {{ $value }} CPU shares and cannot tolerate node failure.' % $._config,


I still feel that reusing the same alert for single node clusters creates confusion. For instance the alert description doesn't fit right in this case:

By definition, a single node cluster can't tolerate any node failure.

I should be able to use all the allocatable CPU without interfering with the system components (etcd, kube-api, kubelet, ...) so I'm not sure why we multiply by 0.85. E.g. the reserved CPU value is tuned to the right usage.

I thought about reserving the final 15% as a leeway for backgrounded processes that may belong to userland but not necessarily be initiated by the user, and fall in a "grey" space.

But I agree, this introduces an opinionated constant threshold that doesn't make sense if we look at the metrics and alert definitions. Like you said, "allocatable" should mean that till 100%, and "overcommitment" should quite literally mean exceeding that limit. I'll drop the threshold.

Talking with Balut, I'm inclined to believe that the users will expect the alert to adapt to SNO, as in, if the requests simply exceed the allocatable resources. It may additionally be ambiguous to introduce new alerts for SNO as a 1+1 multi-node system may stop firing AlertX and start firing AlertYSNO when it's reduced to SNO, since the latter is arguably a derivative of the former to some degree (as some users may expect).

(we could of course explore the possibility of adding SNO-exclusive alerts, but I just wanted to put these points out there, I'm on the fence mostly)

To add on, I think we also need to revise how we handle SNO downstream, as in, at the moment we recognize SNO by a SingleReplica topology infrastructure, however, SNO can technically have more than one node (SNO+1 configurations), which would be in-line with having a different set of alerts for SNO while ensuring that upstream expects the same. I'll see if I can find any SNO-dedicated teams or people who can additionally shed some light here.

@rexagod how would you like to proceed with this PR? I've held back from merging as it seems there's open discussion still.

Apologies for the delay here, we are talking this through internally and I'll update as soon as there's a resolution.

FYI the tests have moved into tests/ directory since:

tests: add more tests #1002

For the sake of brevity, let: Q: kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"} (allocable), and, QQ: namespace_cpu:kube_pod_container_resource_requests:sum{} (requested), thus, both quota alerts relevant here exist in the form: sum(QQ) by (cluster) - (sum(Q) by (cluster) - max(Q) by (cluster)) > 0 and (sum(Q) by (cluster) - max(Q) by (cluster)) > 0, which, in case of a single-node cluster (sum(Q) by (cluster) = max(Q) by (cluster)), is reduced to, sum(QQ) by (cluster) > 0, i.e., the alert will fire if *any* request limits exist. To address this, drop the "max(Q) by (cluster)" buffer assumed in non-SNO clusters from SNO, reducing the expression to: sum(QQ) by (cluster) - sum(Q) by (cluster) > 0 (total requeted - total allocable > 0 to trigger alert), since there is only a single node, so a buffer of the same sort does not make sense. Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

Use kube_node_role{role="control-plane"} (see [1] and [2]) to estimate if the cluster is HA or not. * [1]: https://github.com/search?q=repo%3Akubernetes%2Fkube-state-metrics%20kube_node_role&type=code * [2]: https://kubernetes.io/docs/reference/labels-annotations-taints/#node-role-kubernetes-io-control-plane Also drop any thresholds as they would lead to false positives.

rexagod · 2025-02-10T13:58:57Z

@simonpasquier I've updated the PR to reflect only what upstream is concerned with, i.e., HA and non-HA, leaving logic for any OpenShift-specific variants outside of this patch.

I believe following this trend upstream will be vendor-neutral and future-safe.

rexagod · 2025-02-10T21:50:37Z

I noticed the tests are failing, will fix.

rexagod force-pushed the KubeCPUOvercommit-SNO branch from 5b96fb5 to 5cd53d6 Compare January 6, 2025 07:07

rexagod marked this pull request as ready for review January 6, 2025 07:08

rexagod requested review from povilasv and skl as code owners January 6, 2025 07:08

rexagod marked this pull request as draft January 6, 2025 07:17

rexagod marked this pull request as ready for review January 6, 2025 08:13

rexagod commented Jan 6, 2025

View reviewed changes

rexagod mentioned this pull request Jan 7, 2025

MON-4200: disable --auto-gomemlimit for Prometheus on SNO until we can ensure it won't result in excessive CPU usage openshift/cluster-monitoring-operator#2549

Open

2 tasks

skl reviewed Jan 7, 2025

View reviewed changes

alerts/resource_alerts.libsonnet Outdated Show resolved Hide resolved

alerts/resource_alerts.libsonnet Show resolved Hide resolved

skl self-assigned this Jan 7, 2025

simonpasquier reviewed Jan 9, 2025

View reviewed changes

skl removed their assignment Jan 13, 2025

rexagod commented Jan 21, 2025

View reviewed changes

rexagod requested a review from simonpasquier January 21, 2025 13:39

rexagod force-pushed the KubeCPUOvercommit-SNO branch from 984a0a3 to 3cd8f3c Compare January 21, 2025 19:33

skl approved these changes Jan 23, 2025

View reviewed changes

simonpasquier reviewed Jan 27, 2025

View reviewed changes

skl added the keepalive Use to prevent automatic closing label Feb 3, 2025

rexagod and others added 4 commits February 10, 2025 19:17

fixup! bugfix: refactor alerts to accomodate for single-node clusters

5788ff9

test: Add tests for KubeCPUOvercommit and KubeMemoryOvercommit

41645e2

rexagod force-pushed the KubeCPUOvercommit-SNO branch from 77dc076 to fc9fe6a Compare February 10, 2025 13:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix: refactor alerts to accomodate for single-node clusters #1010

bugfix: refactor alerts to accomodate for single-node clusters #1010

rexagod commented Jan 6, 2025 •

edited

Loading

rexagod Jan 6, 2025

simonpasquier Jan 9, 2025

rexagod Jan 9, 2025

simonpasquier Jan 10, 2025

rexagod Jan 12, 2025 •

edited

Loading

rexagod Jan 13, 2025

rexagod Jan 20, 2025

rexagod left a comment •

edited

Loading

skl left a comment

rexagod commented Jan 27, 2025

simonpasquier Jan 27, 2025

rexagod Jan 27, 2025

rexagod Jan 27, 2025

rexagod Jan 28, 2025

skl Feb 3, 2025

rexagod Feb 4, 2025

skl Feb 7, 2025

rexagod commented Feb 10, 2025

rexagod commented Feb 10, 2025

	sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0)
	0.95 * sum(kube_node_status_allocatable{%(kubeStateMetricsSelector)s,resource="cpu"}) by (%(clusterLabel)s) > 0)

bugfix: refactor alerts to accomodate for single-node clusters #1010

Are you sure you want to change the base?

bugfix: refactor alerts to accomodate for single-node clusters #1010

Conversation

rexagod commented Jan 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rexagod Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rexagod left a comment • edited Loading

Choose a reason for hiding this comment

skl left a comment

Choose a reason for hiding this comment

rexagod commented Jan 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rexagod commented Feb 10, 2025

rexagod commented Feb 10, 2025

rexagod commented Jan 6, 2025 •

edited

Loading

rexagod Jan 12, 2025 •

edited

Loading

rexagod left a comment •

edited

Loading